Tokenization

Tokenization is the process of breaking down text into smaller units called tokens, which are the basic building blocks that language models use to understand and generate language. In the context of AI and large language models (LLMs), a token can be as short as one character or as long as one word, depending on the language and the tokenizer used.

Why Tokenization Matters

Language models like GPT-3, GPT-4, and others do not process raw text directly. Instead, they convert text into tokens, which are then mapped to numerical representations that the model can process. The number of tokens in a prompt or conversation determines how much information the model can consider at once (the context window).

How Tokenization Works

Splitting Text: The tokenizer splits input text into tokens based on rules or learned patterns. For example, the word "tokenization" might be split into ["token", "ization"] or kept as a single token, depending on the tokenizer.
Mapping to IDs: Each token is mapped to a unique integer ID, which is used as input to the model.
Handling Special Characters: Tokenizers handle punctuation, spaces, and special characters in a way that maximizes efficiency and model performance.

Examples

Text Input	Tokens Generated	Number of Tokens
Hello, world!	["Hello", ",", "world", "!"]	4
Artificial Intelligence	["Artificial", " Intelligence"]	2
GPT-4 is amazing.	["GPT", "-", "4", " is", " amazing", "."]	6

Tokenization and Model Limits

The context window of a model is measured in tokens, not characters or words. For example, if a model has a 4,096-token limit, it can process up to 4,096 tokens in a single prompt or conversation. Exceeding this limit means older tokens are dropped or ignored.

Practical Tips

When preparing prompts for LLMs, be aware of token limits to avoid losing important context.
Use online tools or libraries (like OpenAI's tiktoken) to estimate token counts for your text.
Remember that tokenization can split words in unexpected ways, especially for rare or compound words.

Tokens

Below image shows the tokens. Each token is shown in different colors.